|
Twitter text normalization based on unsupervised learning algorithm
DENG Jiayuan, JI Donghong, FEI Chaoqun, REN Yafeng
Journal of Computer Applications
2016, 36 (7):
1887-1892.
DOI: 10.11772/j.issn.1001-9081.2016.07.1887
Twitter messages contain a large number of nonstandard tokens, created unintentionally or intentionally by people. It is crucial to normalize the nonstandard tokens for various natural language processing applications. In terms of the existing normalization systems which perform poorly, a novel unsupervised normalization system was proposed. First, a standard dictionary was used to determine whether a tweet needs to be normalized or not. Second, a nonstandard token was considered to take 1-to-1 or 1-to-
N recovering based on its characteristics. For 1-to-
N recovering, the nonstandard token would be divided into multiple possible words using forward and backward search. Third, some normalization candidates were generated for nonstandard tokens among multiple possible words by integrating random walk and spelling checker. Finally, the best normalized twitter could be obtained by taking all the candidates into consideration of n-gram language model. The experimental results on the manual dataset show that the proposed approach obtains F-score of 86.4%, which is 10 percentage points higher than that of current best graph-based random walk algorithm.
Reference |
Related Articles |
Metrics
|
|